Multi-class composite n-gram language model using multiple word clusters and word successions
نویسندگان
چکیده
In this paper, a new language model, the Multi-Class Composite N-gram, is proposed to avoid a data sparseness problem in small amount of training data. The Multi-Class Composite Ngram maintains an accurate word prediction capability and reliability for sparse data with a compact model size based on multiple word clusters, so-called Multi-Classes. In the Multi-Class, the statistical connectivity at each position of the N-grams is regarded as word attributes, and one word cluster each is created to represent positional attributes. Furthermore, by introducing higher order word N-grams through the grouping of frequent word successions, Multi-Class N-grams are extended to Multi-Class Composite N-grams. In experiments, the MultiClass Composite N-grams result in 9.5% lower perplexity and a 16% lower word error rate in speech recognition with a 40% smaller parameter size than conventional word 3-grams.
منابع مشابه
Multi-Class Composite N-gram Language Model for Spoken Language Processing Using Multiple Word Clusters
In this paper, a new language model, the Multi-Class Composite N-gram, is proposed to avoid a data sparseness problem for spoken language in that it is difficult to collect training data. The Multi-Class Composite N-gram maintains an accurate word prediction capability and reliability for sparse data with a compact model size based on multiple word clusters, called MultiClasses. In the Multi-Cl...
متن کاملMulti-class composite N-gram based on connection direction
A new word-clustering technique is proposed to efficiently build statistically salient class 2-grams from language corpora. By splitting word neighboring characteristics into word-preceding and following directions, multiple (two-dimensional) word classes are assigned to each word. In each side, word classes are merged into larger clusters independently according to preceding or following word ...
متن کاملNew language models using phrase structures extracted from parse trees
This paper proposes a new speech recognition scheme using three linguistic constraints. Multi-class composite bigram models [1] are used in the first and second passes to reflect word-neighboring characteristics as an extension of conventional word n-gram models. Trigram models with constituent boundary markers and word pattern models are both used in the third pass to utilize phrasal constrain...
متن کاملClass-based language model adaptation using mixtures of word-class weights
This paper describes the use of a weighted mixture of classbased n-gram language models to perform topic adaptation. By using a fixed class n-gram history and variable word-given-class probabilities we obtain large improvements in the performance of the class-based language model, giving it similar accuracy to a word n-gram model, and an associated small but statistically significant improvemen...
متن کاملLooking at alternatives within the framework of n-gram based language modeling for spontaneous speech recognition
This paper presents different methods using a weighted mixture of word and word-class language models in order to perform language model adaptation. A general language model is built from the whole training corpus, then several numbers of clusters are created according to a word co-occurrence measure and finally, word models as well as word-class models are built from each cluster. The general ...
متن کامل